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fulness  of  so  called  "array  processors",  combined  with  mini-  and 
superminicomputers,  in  the  analysis  of  computationally  demanding 
engineering  problems,  such  as  in  finite  element  applications. 

An  array  processor,  also  called  an  attached  processor,  is  a 
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form  repetitive  computations  on  well  structured  data  sets  at  ef¬ 
fective  speeds  far  beyond  those  achieved  by  minis  and  superminis 
Although  their  speeds  are  substantially  below  the  newest  gener¬ 
ations  of  large  scale  pipeline  and  parallel  vector  computers, 
they  do  provide  a  clear  cost/performance  advantage.  Further¬ 
more,  the  combination  of  a  minicomputer  and  an  array  processor 
provides  a  high  degree  of  interactivity  in  addition  to  attrac¬ 
tive  computational  speeds. 

Using  an  array  processor/minicomputer  combination  effect¬ 
ively  is  a  difficult  task,  requiring  not  only  knowledge  of  per¬ 
tinent  algorithms  and  computer  programming,  but  also  a  clear 
understanding  of  the  hardware  and  the  details  of  its  operation. 
The  objective  is  to  produce  efficient  implementation  of  pract¬ 
ical  algorithms.  These  implementations,  however,  should  event¬ 
ually  be  user  transparent,  in  order  to  be  of  benefit  to  the 
engineering  community. 

The  array  processor/minicomputer  combination  is  a  special 
case  of  a  multiprocessor  system  in  which  only  two  processors, 
with  widely  differing  characteristics,  are  involved.  One  inter¬ 
esting  factor  here  is  that  the  communication  between  the  two 
processors  lee  ;es  a  substantial  amount  of  control  in  the  hands 
of  the  programmer.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing. 

The  issues  here  relate  to  the  synchronization  of  computations 
and  data  transfers,  as  well  as,  the  design  of  operating  systems 
and  applications  programs  to  handle  the  solution  of  complex 
problems . 

Several  operations  typical  of  finite  element  analysis  were 
chosen  as  a  basis  for  performance  measurements.  Some  experi¬ 
ments  measured  the  speeds  of  execution  on  the  host  computer  and 
the  array  processor,  acting  as  separate  processors.  Other  mea¬ 
surements  were  designed  to  assess  the  performance  of  the  integ¬ 
rated  system.  In  addition  to  these  measurements,  simulation 
was  used  to  predict  the  performance.  The  simulation  results 
are  matched  to  the  measured  results  in  order  to  gain  confidence 
in  the  simulation  procedures.  Once  this  has  been  accomplished, 
the  same  methods  are  used  to  predict  the  performance  more  gen¬ 
erally  . 

Measurements  were  taken  for  simple  matrix  operations, 
numerical  integration  of  solid  element  properties,  and  the  de¬ 
composition  of  a  hypermatrix.  In  the  case  of  the  computation 
of  stiffness  matrices  for  the  solid  element,  several  alternative 
strategies  were  attempted,  some  of  which  were  designed  to  mini¬ 
mize  data  transfers  between  the  host  and  the  array  processor. 

Finally,  the  experi  ,*nce  gained  so  far  is  used  to  examine 
the  possibility  of  implementing  several  algorithms  typical  of 
the  effectiveness  of  the  system  in  execution  of  such  computa¬ 
tions  is  made. 
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SUMMARY 

"The  paper  describes  ongoing  efforts  to  investigate  the  usefulness  'if 
so  called  ''attached  processors*  or  'nrr.iv  processors*,  combined  with  mini- 
and  superminicomputers,  in  the  analysis  of  computationally  demanding 
engineering  problems;  such  as,  in  finite  element  nonlinear  applications.  A 
close  examination  of  some  basic  algorithms,  which  pi av  an  important  part 
in  such  analysis,  is  presented.  Furthermore,  a  methodology  is  developed, 
which  foims  a  useful  guide  to  further  effort.1'.  of  this  nature,  including 
the  more  general  case  of  multiprocessing.  An  emphasis  is  made  on  t.  he 
abiJitv  to  predict  the  performance  of  algorithms  from  time  measurements  of 
certain  basic  operations,  coupled  with  an  understanding  of  the 
characteristics  of  the  hardware  in  question  and  the  interplay  between  its 
various  component;;.  Four  hardware  alternatives  are  considered,  two  oj 
which  have  attached  arrav  piocessors.  Of  the  algorithms  considered,  it 
appears  that  the  decomposition  process  is  the  one  which  benefits  most  Irmi 
the  presence  of  an  array  processor.  The  speed  advantage  gained  in 
stiffness  assembly,  and  forward  and  backward  substitution  inav  both  be 


obtained  by  purchasing  a  more  powerful  superminicomputer. 


INTRODUCTION 

The  advent  of  "super  computers",  such  as  the  STAR  and  CRAY  series, 
and  the  HEP  computer  [1],  provides  an  undoubted  breakthrough  for  large 
scale  finite  element  analysis  1 2 ] .  The  availability  of  inexpensive 
microprocessors  has  triggered  efforts  to  devise  microcomputer  networks, 
using  parallel  and  distributed  processing  to  provide  faster  solution  of 
finite  element  problems  [3,4],  Finite  element  software  is  moving  into  a 
new  ere,  in  which  the  proper  programming  environment  and  modular  program 
systems  [5]  are  replacing  monstrously  large  programs. 

In  contrast  to  efforts  based  on  high  per fomiance  sixth  generation 
computers  [2,6],  this  paper  describes  ongoing  efforts  [7]  to  investigate 
the  usefulness  of  so  called  "attached  processors"  or  "array  processors", 
combined  with  mini-  and  supenn  i n i c  •nupul ors ,  in  the  analysis  of 
computationally  demanding  engineering  problems;  such  as,  in  finite  element 
nonlinear  applications.  The  work  presented  here  does  not  include  actual 
solutions  of  such  problems.  Rather,  a  close  examination  of  some  basic 
algorithms,  which  play  an  important  part  in  nonlinear  analysis,  is 
presented.  Furthermore,  a  methodology  is  developed,  which  forms  a  useful 
guide  to  further  efforts  of  this  nature,  including  the  more  general  case 
of  distributed  processing.  An  emphasis  is  made  on  the  ability  to  predict 
the  performance  of  algorithms  from  time  measurements  of  certain  basic 
operations,  coupled  with  an  understanding  of  the  characteristics  of  the 
hardware  in  question  and  the  interplay  between  its  various  components. 

The  array  processor/min icomputer  combination  is  a  special  case  of  a 
multiprocessor  system  in  which  only  two  processors,  with  widely  differing 
characteristics,  are  present.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing.  The  issues 


here  relate  to  the  synchronization  and  balancing  of  computations  and  data 


transfers,  as  well  as  the  design  of  operating 


systems  and  applications 


programs  to  handle  the  solution  of  complex  problems  [8,9,10], 

Measurements  taken  on  an  array  processnr/min i computer  system  show 
that,  while  the  array  processor  is  perhaps  80  times  faster  than  the  host 
16  bit  minicomputer  in  the  performance  of  basic  matrix  arithmetic,  it  is 
often  not  possible  to  apply  this  computational  power  to  practical 
problems,  due  to  the  various  restrictions  imposed  by  the  limited  use'' 
address  space.  An  alternate  system  is  now  being  put:  together  in  which  a  ?2 
bit  minicomputer,  with  a  large  address  space  and  a  faster  data  transfer 
rate,  is  used  as  the  host. 

Several  algorithms  tvpica!  of  finite  element  analysis  were  chosen  a ' 
a  basis  for  performance  measurements.  Some  experiments  mo  a  sir.  ed  the  speeds 
ot  execution  on  the  host  computer  and  the  array  processor  acting  as 
separate  processors.  Other  measurements  were  designed  to  assess  the 
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problems  may  appear  competitive,  one  should  not  forget  that  severe 
restrictions  are  placed  on  problem  size  due  to  address  limitations, 
rendering  these  figures  purely  hypothetical,  except  in  some  dedicated 
systems  . 


NOMENCLATURE 

It  Is  useful  to  introduce  some  symbols  and  abbreviations,  at  this 
point,  in  order  to  describe  alternate  computer  configurations,  and  refer 
to  problem  parameters.  These  abroviations  are: 

HC16  Refers  to  16-bit  minicomputer  used  independently  or  in 
conjunction  with  an  array  processor.  It  is  characterized  bv  a 
limited  user  address  space  (typically  64  KR)  and  a  slow  data 
bus  (typical  speed  up  to  1.5  MB/sec). 

H032  Alternate  32-bit  superminicomputer  used  independently  or  with 
an  array  processor.  Since  the  array  processor  is  assumed  to 
provide  the  primary  computational  tool,  the  selected  HC32 
system  has  the  advantages  of  a  32-bit  mincomputor  of  a  large 
address  space  (16  MB)  and  a  fast  data  bus  (speed 
2t>.5  MB/sec).  On  the  other  hand,  its  floating  point 
processing  speed  is  somewhat  limited  (0.435  MFI.OPS  in 

Whetstone  tests,  as  opposed  to  the  11C16  time  of  0.190). 
Todays  popular  superminis  have  been  measured  at  1.2  MFLOPS , 
and  tlie  top  of  the  line  of  our  UC32  series  has  been  measured 
at  3.5  M FLOPS . 

AP  Stands  for  array  processor.  The  one  used  in  this  research  is 

primarily  a  double  precision  device,  capable  of  performing  64 
bit  vector  arithmetic,  and  other  well  structured 
computations,  at  speeds  well  above  those  of  mini-  and 


superminicomputers.  So  far,  however,  it  has  been  used  in  the 


nmun 


single  precision  (  32  —  hit  ">  mode,  due  to  address  sp.icu 
limitations.  A  single  precision  vrri/int  of  the  same  ;»r  r;:v 
processor  perform  the  saint  tuuet  ions  fast'  r  and  are 

considerably  less  expensive. 

HC16+AP  Combination  of  the  tIC  1 6  and  AT.  Tile  d  evict's  communicate  via  a 
slow  interface,  using  the  HC 1 6  low  speed  data  bus 

extensively . 

HC32+AF  Combination  of  HC32  and  AT,  which  arc  coupled  via  a  high 
speed  direct  memory  interface. 

I’  Total  number  of  unknowns  in  the  problem  being  solved. 

b  Half  bandwidth  of  problem.  Based  on  element  connectivity. 

N  Subunit  r  ix  size  used  to  partition  the  svstem  of  equations. 

T  'l  ’tal  number  of  partitions  in  the  stiffness  matrix. 

H  Number  of  non-zero  submatrices,  per  row,  in  stiffness  matrix. 

M  Number  of  longitudinal  stations  in  the  solid  model  being 

anal vxed  . 

N  Number  of  node's  across  the  solid  model  in  either  direction. 

K  Total  number  of  t  nod ed  isoparametric  solid  elements  used  to 

model  the  solids  problem. 


HABDWAK K  CHARATER 1ST1CS 

An  array  processor,  also  called  an  attached  processor,  is  a  high 
speed  special  purpose  computational  device,  which  can  perform  repetitive 
computations  on  well  structured  data  sets  at  effective  speeds  far  beyond 
those  achieved  by  minis  and  superminis.  Although  their  speeds  are 
substantially  below  the  newest  generation  of  large  scale  pipeline  and 
parallel  vector  computers,  they  do  provide  a  clear  cost/per formance 
advantage.  Furthermore,  the  combination  of  a  minicomputer  and  an  array 
processor  offers  a  high  degree  of  interactivity  in  addition  to  attractive 


c  ompu t  a  t i on a  1  speeds. 


The  array  processor  is  a  sepai  ite  computer,  running  under  an 
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involving  extensive  decision  making.  Data  transfers  and  message  swapping 
between  the  host  and  the  array  processor  are  problematic  and  can  affect 
performance  appreciably,  as  seen  from  some  of  the  measurements  taken.  The 
amount  of  AP  local  memory  may  he  limited  in  certain  instances.  "his, 
however,  does  not  apply  to  the  system  under  consideration. 

Several  AP  types  are  available  today.  We  restrict  the  discussion  to 
the  specific  device  used  in  this  research  which,  we  believe,  represents 
the  state  of  the  art.  In  order  to  understand  the  operation  of  the  system, 
Fig.  1  explains  the  two  basic  components,  ami  the  various  software  items 
which  control  it,  as  well  as  the  location  of  the  data.  The  operating 
system  of  the  host  computer  controls  the  operation  of  the  host .  including 
t  lie  user  program,  which  specifies  the  sequence  of  events  me  ssarv  to 
perform  the  computation.  A  special  piece  of  software,  called  l.c  DRIVER, 
resides  in  the  host  computer,  ft  facilitates  communication  with  the  arr.iv 
processor  bv  translating  user  FORTRAN  calls  into  small  packages  called 
Function  Control  Blocks  or  FCBs  ,  which  are  shipped  to  the  array  processor 
to  initiate  specific  computations  or  data  transfers  on  the  latter  device. 
The  format  of  a  typical  FCB ,  which  occupies  six  16-bit  words  (11016)  or 
three  32-bit  words  (11032),  is  given  in  Fig.  2.  It  contains  a  function  code 
and  operand  addresses,  which  take  the  form  of  array  processor  buffer 
numbers,  as  well  as  some  control  variables  (checksum,  etc.). 

On  the  array  processor  side,  an  independent  operating  system,  (ailed 
the  EXEC,  resides.  Its  function  is  to  control  the  operation  of  the  arr.iv 
processor  itself.  In  the  array  processor  there  exists  a  matliematic.il 
library,  consisting  of  a  number  of  unlinked  (relocatable)  routines,  which 


may  be  assembled  into  any  program  being  executed  inside  t  h  <  -  arr.v 
processor.  One  of  the  functions  of  KX1-X  is  t  o  link  such  progress,  as 

needed,  to  the  specifications  contained  in  the  FCBs ,  and  control  l  in' i  r 

execution  within  the  array  processor.  Such  programs  take  the  form  of  tu  > 
separate,  but  fully  synchronized ,  programs  called  the  Al’U  and  APS  programs 
respectively.  Data  are  stored  within  the  array  processor  wh i !  •  •  he ; nr 
operated  upon  in  special  areas  called  butters.  One  import  ant'  cuts  i.i.  rat  i  or 
tor  tlie  system  at  hand  is  that ,  in  the  l-'Cli,  a  restricted  number  of  hits  is 

used  to  store  the  buffer  number.  This  poses  a  restriction  or.  the  nunlv-r  • ' ! 

buffers  which  may  be  defined.  At  the  moment,  the  maximum  number  it  huff,  rs 
is  64.  Soon  it  will  be  expanded  to  'll?.  The  hardware  cuif igurai  i  >ns  !  li¬ 
the  HC16+AP  and  the  HC32-»AP  are  given  in  Fig.  ?  and  Fig.  A  re  s  pec  t  i  v>- 1  v . 

BASIC  MKASVKEMfcNTS 

in  attempting  to  assess  the  e  f  feet  ivonoss  of  ;;  part  icular  liardware 
system,  such  as  the  one  being  considered  here,  several  approaches  are 
pissihle.  One  approach  is  to  run  specific  problems  on  an  existing  svstm. 
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for  each  basic  task  from  which  performance  estimates  may  be  made.  It  is 
possible  then  to  run  large  numbers  of  parametric  studies,  in  order  to 


assess  the  effect  of  different  problem  parameters ,  ns  well  as  flu 
effectiveness  of  the  hardware  for  certain  algorithm.'; .  'fhoso  ov-rall 
estimates  are  then  verified  bv  actual  c  oiiipn  t  ..  t  ions  of  realisitic  problems. 

Four  systems  have  bivti  considered,  lit!  I  <■ ,  11(112  ,  ilClf+A!’  and  HC'>7+AP, 
So  far  the  performance  of  the  H(  !  (  and  HbK-rAP  have  been  measured .  The 
performance  figures  for  the  other  two  systems  have  been  esf  imatetl  .  since 
they  have  not  been  fully  installed  yet.  It  is  import  ant  fi  n>te  that  hot! 
host  CPU  and  elapsed  times,  associated  with,  each  opr-rat  i  on ,  have  ho. a 
measured.  However,  only  elapsed  times  an’  presented.  Rv  doing  this,  i ;  :■ 

implied  that  performance  is  measured  by  the  clock  t  ime  needed  to  oy.-.  u!. 
the  algorithm  rather  then  by  the  CPU  time,  which  can  be  d ec ept i v •  ■ .  l!  i 
also  assumed  that  a  system,  such  as  the  one  examined  here.  is  intend-, 
primarily  as  a  stand  alone  high  performance  system,  even  though  it  m  v 
well  be  capable  of  lime  sharing  and  multiprocessing. 

Another  issue  to  consider  is  that  all  measurements  and  storage  sp. 
c  >n  side  rat  ions  are  for  single  precision.  132-hit)  conputati  ms,  even  !  h-.ugl 
t  lie  array  process  >r  is  a  double  piicision  system,  which  lias  also  sing), 
precision  capabilities.  We  expect  to  extend  the  1.1  on  sur .  men  t  s  Inf.  r  t 
double  pr.  ci  si  on.  Many  of  the  conclusions  reached  here,  however ,  m;  v  to 

generalised  or  extended,  using  reasonable  assumptions.  Changing  to  doubli 
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precision  will  impae  •  primarily  tie  H( !  1  ,  s  i  nee  the  addre-ss  space  is  uniti 
limited.  F.xecut  ion  speeds  will  also  affect  the  host  processor  more  that 
the  AP,  since  th.  A)’  is.  basical  iv,  a  double  precision  devi,.  .  It  is  a  1  s 
important  t  >  not.  that  it  sin,'!,  precision  wire  adequate ,  less  exi-nsiv 
array  processors  are  available,  which  would  hi  much  taster. 

The  lirst  type  of  opera!  i->u  of  significance  is  that  of  FUR  transfers 
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STIFFNESS  COMPUTATION  AND  ASSEMBLY 


Tlu'  process  of  stiffness  assembly  is  a  has  if-  and  important  - < n < -  for 
b  o  i  li  linear  and  nonlinear  analysis.  Hire,  two  basic  functions  are 
per  formed ,  element  stillness  c- 'input  at  i  on  and  assembly  inf"  tie-  master 
stiffness  matrix.  For  the  purpose  of  t  h  i  *  investigation,  the  solid  s'vi.d 
shown  in  Fig.  8  was  chosen,  liv  varying  the  number  of  stations  CM)  and  the 
number  of  nodes  in  each  direction  ot  the  c  ms  s- sec  l  i  on  ( I.' )  ,  <!  i  1  !  <  i  •  ■  :i  r 

combinations  of  problem  size  and  bandwidth  are  produced,  which  form  t  he 
basis  lor  c:  parametric  study.  One  important  assumption  made  was  that  t  In- 
core  mem  >rv  must  be  large  enough  to  contain  the  shaded  area  >f  t  he 
st  itfne.ss  matrix,  shown  in  Fig.  8.  In  this  wav,  the  suhmatrices  o ;  t  lo¬ 
st  illness  livp.imatrix  are  general-  d  in  cor--  and  written  t->  the  d  i  sc  -m  1  v 
•nc.  . 

tiiis  p  o  i  a  i  ,  it  is  appropriate  t  •>  bring  up  the  is-u.-  of  pr  ->bl  m 
six.  limitations,  which  apply  here  <-<|ualiv  for  stiffness  assemble  aid 
dec  •  -i.ipos  i  t  i  in .  rig.  b  illustrates  this  po  in!.  Several  lac'ors  limit 
pi  obi  i-ti,  sill"  ,  ami  these  constraints  are  represented  by  straight  1  ine--.  c 
curves  in  the  figure.  For  example,  the  eftect  ->i  the  limited  number  of 
hut  lers  is  si;  iwn  bv  vert  ical  1  ines.  At  present  .  !V’  system  has  a 
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t !'.  i  o  t -i  512,  allowing  "11  submatrices  per  row.  The  size  of  memory  1  in  the 
Ai  is  hi  rent  1  v  -  cpial  t  •>  I  (■  KW ,  minus  II  KW  needed  for  KXKC  and  !  1  >  •  ■ 
mat  bi-mat  i  al  library.  Since  in  tin  current  dec-imposition  routine 
submatrices  are  placed  in  this  memory,  a  submatrix  size  of  30X3C-  can  not 
bi  exceeded.  If  the  memory  is  expanded  to  32  KW,  it  is  possible  to  employ 
62X62  matrices;  and  if  extended  to  f,.'i  KW,  matrices  over  100X1  HO  may  In 
handled.  With  the  total  space  available  on  memories  2  and  3,  a  maximum 
hall  bandwidth  h,  of  a  pprox  imat  e  1  y  "'20,  -an  be  handled.  This  is 
illustrated  by  the  first  ->f  set  o'  curves  showing  the  possible 


combinations  of  S  and  H  which  mnv  bo  handled.  If  the  two  in.'tnnri.s  are 
expanded  to  their  lull  capacity,  h  r.i a v  he  increased  to  If  1  .a^er 
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during  tile  assembly  process.  The  time  required  for  these  transfers  was 


also  computed  and  included  in  the  estimate.  Results  from  various  runs  show 
that  the  stiffness  computation  process  is  speeded  up  by  a  constant  fact  t 
of  a  ppr  ox  imat  e  1  v  5  lor  the  11(116,  and  3.5  tor  HC32,  report!  1*  ss  of 
bandwidth,  bv  adding  the  array  proeessoi  .  This  result  is  somewhat 
disappointing  perhaps,  but  similar  conclusions  have  been  reached  by  others 
[2],  Table  2  gives  the  operations  count  and  timing  for  a  realist  if 
(20X12X12)  problem  where  the  number  of  unknowns  ’s  FT 5 0  and  tie-  half 
bandwidth  is  900. 
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S 
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line  wit 

e  t 
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n  , 

and  compared  with  the  results 

oh  t:a  i  tied 

bv 

the  si  re 
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-  t  ;  'll 

■  t  ogram  . 

Til 

•  i 

‘  f  S 

ults  a  i 

<■  y 

iven  in  Table  A.  and. 

F i g .  ID. 

fable  A  l 

v  \  ' 

1  eft  - 

several  interest  ing  points.  Due  to  d i sc  re penc i e s  between  the  decomposition 
program  and  tin’  opt  inmm  algorithm,  some  di t  ferences  in  the  counts  were 
observed.  In  order  to  remove  tie-  it  feet  of  the  discrepenev,  a  correction 
was  applied,  and  both  tin  collected  and  uneorn c t ed  figures  are  given.  in 
both,  cases,  the  ••it ors  a  to  wl|  within  tin  accepted  limits,  particularly 
when  t  he  U'  .iMim!  t  inns  can,  t  hemse  Ives,  vary  to  the  same  order  of 


magn  i  t  tide  . 'Ill  i  s  is  illustrated  bv  one  of  the  cases,  which  was  run  twice  and 


shows  a  difference  >.  >  1  approximately  10  percent.  Hst:  iinates,  ot>  the  w!i>)e, 


seem  to  be  n: oil- 

accurate 

for  narrow 

ba  ruled 

mat r i <  es , 

wh i ch  is  not 

surprising  since  th 

i  s  wti  s 

on*'  of  the 

h  a  sit- 

a s  sum  pt ions 

in  ad  e  in  the 

operation  count  est 

imat  es . 

The  adjusted 

run  time 

s  are  more 

accurate  than 

the  unadjusted  ones 

•  K  i  K  • 

10  gives  a 

p  i  c  t  o  r  i 

al  representation  of  the 

results  for  a  set  of  decompositions  where  the  bandwidth  is  kept  at  a 
constant  value  of  270.  The  results  give  a  certain  amount  of  confidence  in 
the  basic  approach  used  in  this  paper  and  allows  us  to  extend  rhi 
estimates  to  other  algorithms. 


FORWARD  AND  BACKWARD  SUBSTITUTIONS 
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ir,  p  1  e 

additions  inside  the  array  processor.  In  that  case,  it  might  he 
advantageous  to  periorm  this  part  of  the  compulation  inside  the  host 
computer.  This  would  speed-up  the  process  somewhat  ,  giving  a  total  ''.pee,! 
up  I  act  or  of  5.  I?  over  the  HC.'vV  alone.  At  t  h  i  s  point  in  time,  it  appears 
that  the  advantage  o |  the  array  processor  is  not  as  substantial  here. 
However,  careful  measurements  are  required  before  final  conclusions  are 


mad  e  . 


ON  THE  ADVANTAGES  OF  PARALLEL  COMPUTATION 


In  a  system  incorporating  multiple  asvivhron  ins  devices,  the  question 
com  os  to  mind  as  to  the  possible  advantage  s  of  parallel  computation. 
Whether  an  algorithm  can  be  accel  1  orated  by  allowing  the  different 
hardware  modules  to  operate  independently,  performing  different  but 
related  functions  at  the  same  time,  depends  ->n  both  the  algorithm  and  the 
hardware  configuration  |7).  Fig.  11  shows  a  hypothetical  case  fir  which 
parallel  processing  is  to  be  applied  to  a  specific  algorithm.  Three 
devices  are  considered,  a  host  computer,  a  data  transfer  bus,  and  an  array 
processor.  Fig.  11. a  illustrates  Lite  sequence  ot  operations,  in  real  t  ime. 
for  the  given  system,  assuming  serial  computation.  The  same  process  may  be 
represented  more  clearly  in  Fig.  11. b,  where  it  is  evident  that  the  array 
processor  is  not  being  used  sufficiently.  The  host  and  data  bus,  on  th 
other  hand,  are  used  approximately  half  the  time.  The  maximum  benetit  to 


be  derived  from 

pa  r.» 

1  lel 

i  sat ,  in  t  li  i 

s  case,  i 

s  shown 

in  Fig.  11. 

c  ,  wh 

ich 

a  s  s  i  tin  e  s  t  h  a  t 

I  1 

d 

evices  ca 

U  pelf' 

>rm  the 

reqni red 

f  ii  n  c  t  i 

•Ml  S 

s  imu  1  !  a  tie  on  s  1  v  . 

*i  i  i  i  s 

i  s 

nsua  1  1  y  n  >t 

poss  i  1) ! 

e,  h owe 

via',  since  o 

per  at  i 

’m  s 

c  inducted  in 

. » n  r 

pa  rt 

of  the 

hardware 

usua  1  1  v 

requires  that  >i 

h  •  •  r 

operat j  ons  conducted 

els 

(■where  lie  c 

ompl et  ed 

first. 

wh i c  h  result 

s  i  p 

an 

actual  run  such  as  that  shown  in  Fig.  11. cl. 

Whether  a  particular  algorithm  can  benefit  from  parallel  computation 
or  not,  t;  "St  first  be  determined  by  a  simple’  analysis  of  the  demand  on 
computer  resources.  It  t  lie  largest.  amount  time  is  spent  in  one 

particular  device,  parallel  operation  will  most  certainly  not  help.  It 


might  be  po.ss 

i  b 1 e  to 

shift  some  of  the 

load  from 

one 

device  to  another, 

i  n 

order  to  obta 

in  a  mor 

e  balanced  load  on 

the  s  v  s  t  e 

m  . 

If,  for  exampl e  , 

an 

algorithm  is 

c  •  inipu  t  e 

bound ,  sue h  as  in 

the  case  ■ 

if 

the  decomposition, 

i  t 

i s  poss i b 1 e  t 

o  incre 

ase  the  speed  by  stippl  vi 

ng 

more  than  one  nr 

r  a  v 

pr'  'cessor  ,  or 

C IT' ,  in 

order  to  achieve 

a  balance 

be 

tween  computation 

and 

data  transfer  times.  On  the  other  hand, 


it  the  process  suffers  from 


excessive  <i a t a  transfer  times,  data  buffering,  faster  data  busses,  .•■long 
with  multiple  disk  drives  rind  controllers  m.'iy  be  used.  Careful  p  I  a.r.ni  ng  ot 
system  I/O  operation  would  also  help.  Once  it  is  determined  that  ■'it', 
algorithm  is  potentially  suitable  for  parallel  computation,  detailed 
analysis  in  which  the  discrete  nature  of  the  demands  on  each  system 
component  is  taken  into  consideration,  is  worth  undertaking  [7]. 

U'NCI.l’S  I  OKS 

The  paper  has  succeeded  in  devising  a  methodology  for  predicting  the 
performance  improvement  as  a  result  of  the  addition  of  an  array  processor 
tii  a  minicomputer.  Four  systems  wi  re  considered,  two  of  which  have 
.attached  .irr.iv  processors.  Of  the  algorithms  considered,  it  appeal's  that 
the  decomposition  process  is  the  one  which  benefits  most  from  the  presence 
•  f  an  rrrav  processor  .  The  speed  advant  ag<  gained  in  stiffness  a  s  seinb  1  v  , 
and  forward  and  backward  substitutions  m.iv  both  be  obtained  bv  purchasing 
a  more  powerful  superminicomputer.  As  such,  it  appears  that  a  mix  of  jobs, 
in  which  the  host  c  impute;-  is  freed  to  handle  other  tasks  such  as  pro-  and 
postprocessing  along  with  linearized  iterations  while  the  array  processor 
is  occupied  in  the  decomposit  ion  process,  might  be  advantageous.  One  mil’ll! 
think  of  a ! 1 1  rna t i ves  to  traditional  computet i onnl  strategies.  For 
■  ■/ample,  in  the  modified  Newton  method,  the  host  computer  may  be  used  to 
improve  on  a  current  solut  ion  vector  based  on  a  previously  obtained 
dicompositio.n,  while  the  array  processor  is  used  to  compute  and  decomp  ise 
a  in  w  stillness  matrix  based  on  a  more  recent  solution  point.  Using,  the 
current  time  esi  imates,  the  advantage  gained  m.iv  he  small.  Only  c 
iterations  can  be  performed  in  the  time  required  by  the  HC32+AP  to  compute 
a  new  stiffness  and  decompose  it.  Sparse  matrix  computations  m.iv  be  used 
for  the  iterative  procedure,  rather  than  a  direct  solution.  It  is  too 
early  to  find  out  whether  such  algorithms  could  he  effective  on  an  arrav 
processor . 


V 


One  must  state,  however,  that  the  apparent  advantages  of  the  arrav 
processor  in  the  decomposition  algorithm,  which  seems  to  he  the  most  time 
consuming  of  all  computational  steps,  is  not  to  be  taken  lightly.  it  is 
clear  that  such  speeds  can  not  bo  achieved  with  simple  computer  deei^  '-s 
with  the  same  price  range. 

Many  other  conclusions  come  to  mind.  A  16-hit  minicomputer  with 
limited  address  space  does  not  appear  to  he  suitable  as  a  host  for  a  last 
array  processor.  The  restricted  memory  size  drust  icallv  limits  probln 
size.  A  conventional  communication  interface  is  not  adequate  if  the 
min  ic  omputer/ arrav  processor  combination  is  to  he  utilizer!  e  t  fee  t  i  ve  1  v  .  In 
using  the  hvpermatrix  scheme,  small  suhmat rices  re  to  !■>■•  avoided .  A 


mini  mum  sizr 

of 

?0X  30  is 

recommended . 

For  1  a rg  1  pr oh  1 ems 
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computation  and  as 
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coh'd  up  by  a 
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hut  tit  a 
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nmv 

run  as  much  as  40  times 

faster.  Parallel  proc 

ess i ng 

cl  i, 

not  seem  to  of 

f  nr 

much,  of  an 

advantage,  unless  mult  i pi e  arrav 

pr  oc  s 

q  •i  r  ** 

a i e  used  or  a 

in  i  x 

of  j  nbs  i  s 

carried  on  s 

imu  1  t ancons  1  v  h v  t  lie 

s vs  1 em  . 

Tin* 

man u  f at t  nr i up 

•  *  I 

specialized  hardware 

to  perform  certain 

f  itnc  ! 

i  ■  »r. 

.  Ifeetively  is  ..  valid  idea  and  will  play  an  increasingly  greater  role. 
Future  hardware  will  probably  take  the  form  of  a  network  of  such  systems  . 
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ITEM 

AP 

no  16 

HC32 

HC-DK 

5  5.33 

9.40 

HC-AP 

45.87 

16.10 

Mult ipl i cat  ion 

73.58 

5448.40 

2724.20 

Add  it  ion 

10.00 

65.50 

32.75 

Inver s ion 

112.34 

8039.34 

4019.67 

MRB  transfer 

8 . 50 

1 .00 

FCB  transfer 

8 . 50 

i  .  00 

Table  ! .  TIMING  OF  BASIC  OPERATION  FOR  (40X40  SURMA TRICKS ) 

(Times  in  msecs.) 


ITEM 

tit:  1 6 

11C  32 

11C16+AP 

SPEED  UP 
FACTOR 

HC32hAP 

SPEED  UP 
FACTOR 

Trans  for s 

23  7.7 

3  5.7 

237.7 

1  .000 

35.7 

1  .000 

E  limit.  C'inp. 

768  7.  ‘i 

384  3.0 

1  200.  1 

6.406 

1 078. 2 

3.565 

Total 

7925.6 

3879.7 

1437.8 

5.512 

1114.0 

3.483 

Para! I  el 

7687.9 

3843.9 

I  200.  1 

6.406 

1078.2 

3  .  565 

NOTH:  HC16  times  are  hypothetical  due  to  memory  restrictions. 

lab:.-  2.  KKIA'I  lVi:  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(STIFFNESS  ASSEMBLY)  (Times  in  secs.) 


ITEM 

ill.  1  6 

11C '<2 

HO  U-.+AP 

SPEED  UP 
FACTOR 

I1C72+AP 

SPEED  UP 
FACTOR 

Tran s  f er s 

475  .  r' 

71.5 

838.6 

0.567 

71.5 

1  .000 

Mill  t  i  pi  icat  i  on 

555952 . 1 

277976.0 

7144.0 

77.821 

7  1  44 . 0 

3  8 . 9 1  0 

Add i t i on 

2085.4 

1042.7 

143.3 

14.550 

143.3 

7.275 

Inversion 

FCB  transfer 

3906.4 

1953.2 

51.9 

526.4 

75.271 

51  . 9 
61.9 

37.635 

Total 

562419.3 

281043.3 

8704.3 

64.614 

7472.7 

37.609 

Para  lie] 

561 943.8 

280971 .8 

7339.2 

76.567 

7339.2 

38 .283 

Table  3.  RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(DECOMPOSITION)  (Times  in  secs.) 


NO  . 
o  f 

Prtns  . 

Submat  r i x 
size 

S 

N . S .M/ row 

H 

U 

b/U 

Est  imat:  eel 
( prog  .  ) 

Me as ur  e 
(  run) 

Pent 

or  r 

Ad  j  . 

Pent 

err 

40 

30 

0 

1  200 

0.225 

2  !  7 

2  3  7 

1  6 

2  a  a 

7 

30 

30 

9 

900 

0 . 3 

1  36 

1  79 

1  3 

163 

5 

20 

30 

9 

180 

0.45 

94 

119 

21 

I  !  1 

1  5 

15 

30 

9 

4  50 

0.60 

6  3 

82 

2  3 

- 

- 

15 

30 

9 

450 

0.60 

63 

74 

15 

- 

- 

15 

24 

9 

360 

0 . 60 

4  9 

51 

4 

- 

- 

15 

20 

0 

300 

0.60 

42 

4  5 

7 

- 

- 

15 

16 

9 

240 

0.6O 

38 

34 

1  2 

- 

- 

15 

10 

9 

1  50 

0.60 

3  3 

31 

6 

- 

- 

Table  4.  COMPARISON  of  MEASURED  and  PREDICTED  TIMES 
(MATRIX  DECOMPOSITION)  (Times  in  secs.) 


ITEM 

lie:  1 6 

HC  3  2 

11C16+AP 

SPEED  UP 
FACTOR 

11032  +AP 

SPEED  UP 

FACTOR 

Tarnsfers 

934.3 

140.5 

1  647.8 

.  567 

140.5 

1  .(>00 

Mu  1 t  ns  . 

2564.6 

1 282.3 

64.9 

79.340 

64 . 11 

19.770 

Add  i  t  i.  on 

30.9 

15.5 

8  3 . 3 

.  371 

8  3.3 

.  185 

Inversion 

FCB  transfer 

.0 

.0 

.0 

416.7 

.  000 

.0 

49.0 

.  000 

TOTAL 

3529.8 

1438.2 

2212.7 

1  .  595 

33  7.7 

4.2  59 

PARALLEL 

2395.5 

1 297.8 

2064.5 

1  .257 

189.5 

6 . 848 

T able  5 . 


RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(FORWARD  AND  BACKWARD  SUBSTITUTIONS)  (Times  in  sees.) 


FIGURE  CAUTIONS 


1  Schematic  of  Control  and  Patti  Flow  between  H.C.  and  A.P. 


2  FOB  Format . 


3  System  with  ! 6  Bit  Host. 


5.a  Data  Transfer  Speed 
(HC16-AP) 

5.b  Data  Transfer  Speed 
(HC16-DK) 

6  Array  Processor  Performance 
(Matrix  Mu  1 1 i pi i c  a t i on  Ti  me ) 

7  Array  Processor  Performance 

(Matrix  Inverse) 

8  Solid  Model  Example 

9  Problem  Size  Limitation 

(Single  Precision) 

10  Matrix  Decomposition  on  HC16+AP 

Estimates  vs.  Measurements 


11  Serial  and  Parallel  Processing 


FIG. 4  SYSTEM  W”H  32  BIT  HOST 
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